Phylogenetic Model Choice: Justifying a Species Tree or Concatenation Analysis
نویسنده
چکیده
There are two paradigms for the phylogenetic analysis of multi-locus sequence data: one which forces all genes to share the same underlying history, and another that allows genes to follow idiosyncratic patterns of descent from ancestral alleles. The first of these approaches (concatenation) is clearly a simplified model of the actual process of genome evolution while the second (species-tree methods) may be overly complex for histories characterized by long divergence times between cladogenesis. Rather than making an a priori determination concerning which of these phylogenetic models to apply to our data, we seek to provide a framework for choosing between concatenation and species-tree methods that treat genes as independently evolving lineages. We demonstrate that parametric bootstrapping can be used to assess the extent to which genealogical incongruence across loci can be attributed to phylogenetic estimation error, and demonstrate the application of our approach using an empirical dataset from 10 species of the Natricine snake sub-family. Since our data exhibit incongruence across loci that are clearly caused by a mixture of coalescent stochasticity and phyogenetic estimation error, we also develop an approach for choosing among species tree estimation methods that take gene trees as input and those that simultaneously estimate gene trees and species trees. *Corresponding author: Bryan C. Carstens, Evolution, Ecology and Organismal Biology, The Ohio State Univeristy, 318 W. 12th Ave., Columbus, OH 43210, USA, Tel: 614-292-6587; E-mail: [email protected] Received May 02, 2013; Accepted July 24, 2013; Published July 29, 2013 Citation: McVay JD, Carstens BC (2013) Phylogenetic Model Choice: Justifying a Species Tree or Concatenation Analysis. J Phylogen Evolution Biol 1: 114. doi:10.4172/jpgeb.1000114 Copyright: © 2013 McVay JD, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Introduction There are two primary paradigms for estimating phylogeny from multi-locus sequence data [1]. The conventional method, which developed from arguments in favor of total evidence [2], estimates phylogeny by concatenating data across multiple genes collected from exemplar samples. In this approach, the data are treated as a single locus, and essentially the estimate of genealogy from each locus is averaged across genes. Underlying this method is the intuition that phylogenetic accuracy improves with an increase in the number of variable sites [3]. While this assumption certainly holds within a particular locus, applying this method across multiple loci requires the assumption that the gene trees across loci share a similar topology. When this is demonstrably not the case, incongruence across loci is attributable to phylogenetic estimation error rather than to coalescent processes (e.g., the independent sorting of alleles across loci). Recently, the primacy of concatenation has been challenged on several fronts [4-8], and methods that estimate phylogeny while allowing for incongruence across loci due to coalescent processes have been proposed. These coalescent-based approaches to phylogeny inference estimate species tree either given gene trees [9,10], or estimate gene trees and species tree topologies simultaneously [11,12]. Either approach accounts for population-level processes, such as the incomplete sorting of ancestral polymorphism that can cause gene tree discordance. Given the growing criticism of concatenation, empiricists are faced with a vexing decision regarding the choice of phylogenetic method to apply to their system. Coalescent-based approaches are often favored a priori in phylogeographic investigations, where the incomplete sorting of ancestral polymorphism can be dramatically evident across loci [4,1317], while concatenation continues to be favored among those working at deeper taxonomic levels [18-20]. However, it is clear that population level processes such as the sorting of ancestral polymorphism have occurred throughout the history of life; further, one of the central theses of the modern synthesis is the expectation that evolutionary processes within populations ultimately produce phylogenetic patterns [21]. This led Edwards [1] to argue that species tree approaches are preferable on first principles. Philosophical implications aside, the question of phylogenetic method choice is also of dramatic practical importance because the ideal sampling schemes for concatenation and coalescent-based approaches are quite different. Since the former assumes that population-level processes do not have an effect on phylogeny estimation, systematists who concatenate their data benefit from sampling as many genes as possible and fewer individuals per species. Alternatively, coalescent-based approaches appear to be most accurate with intermediate numbers of loci and multiple individuals sampled within species [22-24]. This places an empiricist in a difficult position; optimally they need to recognize which of these approaches appears to be appropriate given their data before all of it is collected in order to employ the optimal sampling scheme. It is also the position we found ourselves in some months ago, and in this study we propose an approach to answering this question using a preliminary data set of 7 genes from 1-2 individuals for each of 10 species of thamnophiine snakes. Given our data, how should we determine which of the competing phylogenetic paradigms to employ? Perhaps the most important evidence available to empiricists who seek to objectively determine whether to concatenate their data or use species-tree methods is the degree of incongruence among loci. If the gene trees are mostly congruent, this is evidence that the branch lengths of the species tree are sufficiently long to have allowed lineage sorting to reach completion, and thus concatenation may be justified. Alternatively, incongruence among gene trees may be caused by coalescent processes and would suggest that coalescent-based methods are required. One approach would simply be to measure the incongruence across gene trees using a metric for tree comparison such as the RobinsonFoulds distance [25]. Distributions of the pairwise R-F distances can be substantial at shallow phylogenetic depths; this incongruence can
منابع مشابه
Estimating phylogenetic trees from genome-scale data.
The heterogeneity of signals in the genomes of diverse organisms poses challenges for traditional phylogenetic analysis. Phylogenetic methods known as "species tree" methods have been proposed to directly address one important source of gene tree heterogeneity, namely the incomplete lineage sorting that occurs when evolving lineages radiate rapidly, resulting in a diversity of gene trees from a...
متن کاملReconstructing Posterior Distributions of a Species Phylogeny Using Estimated Gene Tree Distributions
The desire to infer the evolutionary history of a group of species should be more viable now that a considerable amount of multilocus molecular data is available. However, the current molecular phylogenetic paradigm still reconstructs gene trees to represent the species tree. Further, commonly used methods to combine data, such as the concatenation method, the consensus tree method, or the gene...
متن کاملResolving the Gene Tree and Species Tree Problem by Phylogenetic Mining
The gene tree and species tree problem remains a central problem in phylogenomics. To overcome this problem, gene concatenation approaches have been used to combine a certain number of genes randomly from a set of widely distributed orthologous genes selected from genome data to conduct phylogenetic analysis. The random concatenation mechanism prevents us from the further investigations of the ...
متن کاملData Concatenation, Bayesian Concordance and Coalescent-Based Analyses of the Species Tree for the Rapid Radiation of Triturus Newts
The phylogenetic relationships for rapid species radiations are difficult to disentangle. Here we study one such case, namely the genus Triturus, which is composed of the marbled and crested newts. We analyze data for 38 genetic markers, positioned in 3-prime untranslated regions of protein-coding genes, obtained with 454 sequencing. Our dataset includes twenty Triturus newts and represents all...
متن کاملPoint of View Phylogenetic Analysis in the Anomaly Zone
The concatenation method has been widely used as a means of combining data to estimate phylogenetic trees (Huelsenbeck et al. 1996a, 1996b; Glazko and Nei 2003). However, simulation studies have shown that the maximum likelihood (ML) estimate of the species tree for concatenated sequences may be statistically inconsistent if the gene trees are highly heterogeneous (Kolaczkowski and Thornton 200...
متن کامل